Skip to main content

Data Preparation

Data preparation means converting raw data into a format that machine learning can use.

Raw data is usually designed for business systems, logs, or analytics. It is not usually ready for model training.

For this beginner example, we will use mock GA4-style event data.

Mock Data Source

The sample data is generated by this script:

generate_ga4_data.py

The generated file is:

ga4_mock_data.json

This file uses JSON Lines format.

That means:

one line = one JSON object = one event

This page only uses the generated mock GA4 event data. Later lessons still use Python, pandas, and scikit-learn.

Partial Sample Data

The raw data looks like this:

{
"event_date": "20230101",
"event_timestamp": 1672574760000000,
"event_name": "session_start",
"user_pseudo_id": "f282d77c-aef5-4089-b73b-3d5b00c914f2",
"event_params": [
{
"key": "ga_session_id",
"value": {
"int_value": 1780749951
}
},
{
"key": "page_title",
"value": {
"string_value": "Home Page"
}
}
]
}

Another event may contain product information:

{
"event_date": "20230101",
"event_timestamp": 1672574763973844,
"event_name": "view_item",
"user_pseudo_id": "f282d77c-aef5-4089-b73b-3d5b00c914f2",
"event_params": [
{
"key": "ga_session_id",
"value": {
"int_value": 1780749951
}
},
{
"key": "page_title",
"value": {
"string_value": "Home Page"
}
},
{
"key": "item_id",
"value": {
"string_value": "SKU_4001"
}
},
{
"key": "price",
"value": {
"double_value": 45.0
}
}
]
}

The important idea is:

raw event data is nested

For beginner machine learning practice, we will flatten the useful fields into normal columns.

Load Data With pandas

Use read_json with lines=True.

import pandas as pd

raw_events = pd.read_json("ga4_mock_data.json", lines=True)

Preview the raw data.

raw_events.head()

Check the columns.

raw_events.columns

Expected columns:

  • event_date
  • event_timestamp
  • event_name
  • user_pseudo_id
  • event_params

Flatten Event Parameters

The event_params column contains a list of key-value objects.

For example, ga_session_id, item_id, and price are inside event_params.

Create a helper function to extract one parameter.

def get_event_param(event_params, target_key):
for param in event_params:
if param["key"] != target_key:
continue

value = param["value"]

for value_type in ["int_value", "double_value", "string_value"]:
if value_type in value:
return value[value_type]

return None

Create feature_events, a flatter DataFrame for beginner-friendly analysis.

feature_events = raw_events.copy()

feature_events["event_date"] = pd.to_datetime(
feature_events["event_date"],
format="%Y%m%d",
)
feature_events["user_id"] = feature_events["user_pseudo_id"]
feature_events["session_id"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "ga_session_id")
)
feature_events["item_id"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "item_id")
)
feature_events["price"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "price")
)

feature_events = feature_events[
[
"event_date",
"event_timestamp",
"user_id",
"session_id",
"event_name",
"item_id",
"price",
]
]

Now feature_events is easier to use:

event_dateuser_idsession_idevent_nameitem_idprice
2023-01-01user uuid1780749951session_start
2023-01-01user uuid1780749951view_itemSKU_400145.0

This is still event-level data.

For machine learning, we will later convert it into user-level data.

many events -> one row per user

Basic Cleaning

Check the first few rows.

feature_events.head()

Check column types.

feature_events.dtypes

Check missing values.

feature_events.isna().sum()

Missing values are not always bad.

For example, session_start may not have item_id or price. That is normal because no product was viewed yet.

Define The Problem First

Before writing more code, define the machine learning problem clearly.

For this lesson, the problem is:

Predict whether a user has purchase behavior based on simple event summaries.

This is a simplified classroom example.

In real projects, we usually need time windows. For now, we skip that because the first goal is to understand the basic data shape.

Unit Of Prediction

The unit of prediction answers:

What does one row mean?

For this example:

one row = one user

This is important because the model needs a stable meaning for each row.

Bad design:

one row = one event

This would make the label confusing because one user can have many events.

Better design:

one row = one user with summarized behavior

Output For Feature Engineering

For this beginner example, we do not split events by time window.

The output of this lesson is:

feature_events

The next lesson will use feature_events to build a user-level feature table.

Basic Data Checks

Check event counts.

feature_events["event_name"].value_counts()

Check date range.

feature_events["event_date"].min(), feature_events["event_date"].max()

Check user count.

feature_events["user_id"].nunique()

Data Preparation Checklist

Before moving to feature engineering, confirm:

  • The business question is clear
  • The prediction target is clear
  • One row has a clear meaning
  • Raw data has enough rows
  • Raw data has the events needed to create features and labels

Good machine learning starts before model training. It starts with a clear data design.